Vectorization

Vectorization is a crucial and basic step in stylometric research: it refers to the process of turning text into numbers. More precisely, it refers to the creation of the two-dimensional X matrix, which we have been carelessly importing so far: in this matrix, the rows represent documents and the columns represent stylometric features, such as word frequencies. Vectorization is therefore closely related to feature extraction, or determining which stylistic properties should be extracted from documents to arrive at a reliable corpus representation, which is useful for stylometric research. While feature extraction has been a popular research topic in, for instance, authorship studies, there exist few reliable practical introductions to the topic. This is a shame: vectorization is a foundational preprocessing step in stylometry and it has a huge impact on all subsequent analytical steps. It is a pity that most papers are very explicit about the preprocessing steps taken, so that many practical questions remain unanswered:

Was punctuation removed?
Were texts lowercased?
What about character n-grams across word boundaries?
Were pronouns deleted before or after calculating relative frequencies?
Was culling performed before or after segmenting texts into samples?
etc.

This chapter is therefore entirely devoted to this important topic - partially to help raise awareness. The course repository comes with a module called vectorization.py, which contains a Vectorizer object which has been developed in the context of the work on pystyl. If you import this class, you can check out the documentation:



In [2]:

    
from vectorization import Vectorizer
help(Vectorizer)









    



Help on class Vectorizer in module vectorization:

class Vectorizer(builtins.object)
 |  Vectorize texts into a sparse, two-dimensional
 |  matrix.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, mfi=100, ngram_type='word', ngram_size=1, vocabulary=None, vector_space='tf', lowercase=True, min_df=0.0, max_df=1.0, ignore=[])
 |      Initialize the vectorizer by setting up a
 |      vectorization pipeline via sklearn as 
 |      `self.transformer`
 |      
 |      Parameters
 |      ----------
 |      mfi: int, default=100
 |          The nb of most frequent items (words or
 |          ngrams) to extract.
 |      ngram_type: str, default='word'
 |          Set the type of features to be used
 |          for the ngram extraction:
 |          - 'word': individual tokens
 |          - 'char': character ngrams
 |          - 'char_wb': character ngrams (but not
 |              across word boundaries)
 |      ngram_size: int, default=1
 |          The length of the ngrams to be extracted.
 |      vocabulary: list, default=None
 |          Vectorize using an existing vocabulary.
 |      vector_space: str, default: 'tf'
 |          Which vector space to use (see below).
 |          Must be one of: 'tf', 'tf_scaled',
 |          'tf_std', 'tf_idf', 'bin'.
 |      lowercase: boolean, default=True
 |          Whether or not to lowercase the input texts.
 |      min_df: float, default=0.0
 |          Proportion of documents in which a feature
 |          should minimally occur.
 |          Useful to ignore low-frequency features.
 |      max_df: float, default=0.0
 |          Proportion of documents in which a feature
 |          should maximally occur.
 |          Useful for 'culling' and ignoring features
 |          which don't appear in enough texts.
 |      ignore: list(str), default=[]
 |          List of features to be ignored.
 |          Useful to manually remove e.g. stopwords or other
 |          unwanted items.
 |      
 |      Notes
 |      -----------
 |      The following vector space models are supported:
 |      - 'tf': simple relative term frequency model
 |      - 'tf_scaled': tf-model, but normalized using a MinMaxScaler
 |      - 'tf_std': tf-model, but normalized using a StdDevScaler
 |      - 'tf_idf': traditional tf-idf model
 |      - 'bin': binary model, only captures presence of features
 |  
 |  vectorize(self, texts)
 |      Vectorize input texts and store them in
 |      sparse format as `self.X`.
 |      
 |      Parameters
 |      ----------
 |      texts: 2D-list of strings
 |          The texts to be vectorized.
 |          Assumed untokenized input in the case of
 |          `ngram_type`='word', else expects
 |          continguous strings.
 |      
 |      Returns
 |      ----------
 |      X: array-like, [n_texts, n_features]
 |          Vectorized texts in sparse format.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

As you can see, this vectorizer offers easy access to a variety of vectorization strategies. All this code is based on sklearn library, but seamlessly wraps around the different modules which are needed. Importantly, the Vectorizer offers access to a number of vectorization pipelines that are common in stylometry, but much less in other fields of Machine Learning. Let us load a larger corpus this time and test the vectorizer:



In [3]:

    
import glob
import os

authors, titles, texts = [], [], []
for filename in glob.glob('data/victorian_large/*.txt'):
    with open(filename, 'r') as f:
        text = f.read()
    author, title = os.path.basename(filename).replace('.txt', '').split('_')
    authors.append(author)
    titles.append(title)
    texts.append(text)

As you can see we loop over the txt-files under the data/victorian_large directory and end up with three lists (authors, titles, and the actual texts) which can easily zipped together:



In [5]:

    
for t, a in zip(titles, authors):
    print(t, 'by', a)









    



Agnes by ABronte
Tenant by ABronte
Emma by Austen
Pride by Austen
Sense by Austen
Jane by CBronte
Professor by CBronte
Villette by CBronte
Bleak by Dickens
David by Dickens
Hard by Dickens
Wuthering by EBronte
Adam by Eliot
Middlemarch by Eliot
Mill by Eliot
Joseph by Fielding
Tom by Fielding
Clarissa by Richardson
Pamela by Richardson
Sentimental by Sterne
Tristram by Sterne
Barry by Thackeray
Pendennis by Thackeray
Vanity by Thackeray
Barchester by Trollope
Phineas by Trollope
Prime by Trollope

Let us start with some basic preprocessing. The function preprocess() below lowercases each text and only retains alphabetic characters (and whitespace). Additionally, to speed things up a bit, we truncate each document after the 200K first characters:



In [6]:

    
def preprocess(text, max_len=200000):
    return ''.join([c for c in text.lower()
                        if c.isalpha() or c.isspace()])[:max_len]

Let us apply this new function:



In [7]:

    
for i in range(len(texts)):
    texts[i] = preprocess(texts[i])

We can now instantiate our vectorizer with some traditional settings; we will extract the 100 most frequent words and scale them using the per-column standard deviation. We limit this vectorizer to word unigrams (ngram_size=1) and we specify the 'culling' rule (cf. min_df) that words should be present in at least 70% of all texts:



In [8]:

    
vectorizer = Vectorizer(mfi=100,
                        vector_space='tf_std',
                        ngram_type='words',
                        ngram_size=1,
                        min_df=0.7)

We can now use this object to vectorize our lists of documents:



In [9]:

    
X = vectorizer.vectorize(texts)
print(X.shape)

As requested, we indeed seem to have obtained a two-dimensional matrix, with for each text 100 feature columns. To find out to which words these columns correspond, we can access the vectorizer's feature_names attribute:



In [10]:

    
print(vectorizer.feature_names)









    



['about', 'all', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been', 'before', 'but', 'by', 'can', 'could', 'did', 'do', 'for', 'from', 'good', 'great', 'had', 'has', 'have', 'he', 'her', 'him', 'his', 'how', 'if', 'in', 'into', 'is', 'it', 'its', 'know', 'like', 'little', 'made', 'man', 'may', 'me', 'might', 'miss', 'more', 'mr', 'mrs', 'much', 'must', 'my', 'never', 'no', 'not', 'now', 'of', 'on', 'one', 'only', 'or', 'other', 'out', 'own', 'said', 'say', 'see', 'she', 'should', 'so', 'some', 'such', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'they', 'think', 'this', 'time', 'to', 'too', 'up', 'upon', 'very', 'was', 'we', 'well', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'would', 'you', 'your']

Having such a module is great, but it also hides a lot of the interfacing which is needed with sklearn. In the next paragraphs, we will have a lot at the preprocessing functionality which sklearn offers to deal with text.

Integer frequencies

Calculating absolute frequencies of, for instance, words in texts is something that is rarely done in stylometry. Because we often work with texts of unequal length, it is typically safer to back off to relative frequencies. Nevertheless, it is good to know that sklearn's supports the extraction of absolute counts with its CountVectorizer:



In [11]:

    
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_features=100)
X = vec.fit_transform(texts)
print(X.shape)

Here, we immediately use the important max_features parameter, which controls how many of the most frequent words will be returned (cf. the 2nd dimension of the vectorized matrix). Notice that all the vectorization methods discussed here. are implemented as unsupervised methods in sklearn, so that they all have fit() and transform() methods. One warning is important here: if we check the data type of the matrix being returned, we see that this is not a simple np.array:



In [12]:

    
type(X)









    Out[12]:





scipy.sparse.csr.csr_matrix

We see that sklearn by default returns a so-called sparse matrix, which only explicitly stores non-zero values. While this very efficient for larger datasets, there are many methods which cannot deal with such sparse matrices - such as sklearn's PCAobject, to give but one example. If memory is not that much of an issue, it is always safer to convert the sparse matrix back to a 'dense' array:



In [13]:

    
X = X.toarray()
type(X)









    Out[13]:





numpy.ndarray

Finally, to access the names of the features which correspond to our columns in X, we can access the following function:



In [14]:

    
print(vec.get_feature_names())









    



['about', 'all', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been', 'before', 'but', 'by', 'can', 'could', 'did', 'do', 'for', 'from', 'good', 'great', 'had', 'has', 'have', 'he', 'her', 'him', 'his', 'how', 'if', 'in', 'into', 'is', 'it', 'its', 'know', 'like', 'little', 'made', 'man', 'may', 'me', 'might', 'miss', 'more', 'mr', 'mrs', 'much', 'must', 'my', 'never', 'no', 'not', 'now', 'of', 'on', 'one', 'only', 'or', 'other', 'out', 'own', 'said', 'say', 'see', 'she', 'should', 'so', 'some', 'such', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'they', 'think', 'this', 'time', 'to', 'too', 'up', 'upon', 'very', 'was', 'we', 'well', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'would', 'you', 'your']

Of course, with max_features set to 100, this list is dominated by function words, which are typically the most frequent items in a corpus of texts. Note that extracting binary features - which simply records the absence or presence of items in texts - is also supported, although this is of course even less common in current stylometry:



In [15]:

    
vec = CountVectorizer(max_features=50, binary=True)
X = vec.fit_transform(texts).toarray()
print(X)









    



[[1 1 1 ..., 1 1 1]
 [1 1 1 ..., 1 1 1]
 [1 1 1 ..., 1 1 1]
 ..., 
 [1 1 1 ..., 1 1 1]
 [1 1 1 ..., 1 1 1]
 [1 1 1 ..., 1 1 1]]

When using only 50 top-frequent features, it is logical that these are present in most of our texts.

Real-valued Frequencies

Working with relative frequencies is much more common in stylometry. Although this is somewhat counter-intuitive, sklearn does not have a dedicated function for this vectorization strategy. Rather, one must work with the TfidfVectorizer, which can be imported as follows:



In [16]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

Tfidf stands for term-frequency/inverse-document-frequency. This particular vectorization method is one of the golden oldies in Information Retrieval: it gives more importance to rare words in texts, by dividing the relative frequency of a 'term' (i.e. 'word') in a document by the inverse of the document. Thus, the rarer a word, the more its importance will be boosted in this model. Note how this model in fact captures the inverse intuition of Burrows's Delta, which gives more weight to highly common words. Tfidf is not very common in stylometry or authorship attribution in particular, although one could easily argue that it is not necessarily useless: if a rare word occurs in two anonymous texts, this does seem to increase the likelihood that both documents were authored by the same individual. In many ways, the TfidfVectorizer can be parametrized in the same way as the CountVectorizer, the main exception being that it will eventually yield a matrix of real number, instead of integers:



In [17]:

    
vec = TfidfVectorizer(max_features=10)
X = vec.fit_transform(texts).toarray()
print(vec.get_feature_names())
print(X)









    



['and', 'he', 'her', 'in', 'it', 'of', 'that', 'the', 'to', 'was']
[[ 0.52575819  0.12259351  0.1361753   0.18835796  0.14010687  0.31631271
   0.13188632  0.50681517  0.47214586  0.19407661]
 [ 0.5167485   0.07367298  0.18919775  0.17605421  0.16948244  0.30783545
   0.13939066  0.53750146  0.447572    0.16913656]
 [ 0.46035051  0.17061405  0.20904063  0.19904972  0.2071193   0.41692847
   0.13910424  0.45189666  0.44613267  0.19059587]
 [ 0.4025089   0.16591402  0.22729477  0.21167058  0.16182197  0.39618482
   0.17112208  0.5007181   0.46946971  0.17447012]
 [ 0.40231894  0.15788139  0.23737932  0.21620464  0.20023076  0.42980888
   0.13522077  0.47995944  0.43723859  0.21323275]
 [ 0.4786827   0.05292141  0.15295995  0.19563851  0.13964424  0.34620845
   0.10754996  0.61013265  0.38069273  0.19563851]
 [ 0.45124529  0.11094975  0.13440557  0.22525033  0.13738409  0.40768448
   0.12993779  0.59905419  0.34923109  0.18020027]
 [ 0.45466908  0.09994391  0.24455972  0.24039539  0.14423723  0.32406056
   0.13818003  0.57392033  0.36797531  0.23660964]
 [ 0.46997803  0.13418311  0.10451448  0.25285762  0.15980602  0.33579492
   0.16486317  0.62506404  0.32332061  0.15980602]
 [ 0.48980216  0.13955257  0.11697789  0.22540476  0.20111988  0.32117614
   0.20385621  0.55786824  0.35059163  0.23840231]
 [ 0.45429342  0.1733393   0.15383863  0.23075794  0.17875615  0.36906826
   0.1625056   0.58465902  0.34884534  0.16900582]
 [ 0.54154418  0.20681018  0.15386393  0.16594562  0.1407162   0.27645759
   0.12970054  0.57352513  0.39407644  0.12437038]
 [ 0.44400914  0.12235381  0.13864454  0.18058451  0.12096736  0.30709766
   0.1434971   0.65821496  0.3878581   0.15112255]
 [ 0.34645367  0.1676139   0.17922687  0.22490456  0.14864604  0.44516393
   0.2094206   0.54851938  0.40993791  0.19006564]
 [ 0.43735697  0.14742796  0.20692973  0.20200283  0.17509439  0.31380552
   0.15083889  0.59956562  0.38922497  0.20124485]
 [ 0.36303153  0.18876322  0.08268685  0.22038847  0.11925355  0.39762164
   0.14231363  0.65655338  0.37027898  0.12814815]
 [ 0.3124657   0.13829379  0.13209667  0.23027221  0.11905008  0.44945483
   0.13894612  0.63536866  0.39335452  0.1373153 ]
 [ 0.40169724  0.1943819   0.14188738  0.21302126  0.16242871  0.34387718
   0.24079011  0.42984648  0.57439662  0.13427947]
 [ 0.64599705  0.22787251  0.11823573  0.16517174  0.17412899  0.22536447
   0.15908081  0.29487275  0.52919448  0.1268347 ]
 [ 0.3734745   0.11876429  0.08289386  0.22336126  0.22727988  0.36593869
   0.09615687  0.68364823  0.33036969  0.13534306]
 [ 0.36108594  0.13216217  0.02899476  0.27005587  0.16958565  0.45953327
   0.15340252  0.63316469  0.30882793  0.1365451 ]
 [ 0.45840952  0.0943331   0.07121224  0.2167195   0.08755098  0.38226483
   0.12886024  0.64954193  0.31197743  0.19822281]
 [ 0.48541875  0.22037045  0.13101476  0.21221976  0.07667684  0.33357445
   0.11441151  0.64843251  0.26474642  0.1811264 ]
 [ 0.47757949  0.1403165   0.14157495  0.22054231  0.08116964  0.33380227
   0.10822618  0.66194154  0.29164439  0.17492371]
 [ 0.34791326  0.20964387  0.10183555  0.18187054  0.10093964  0.36194924
   0.17321003  0.67820687  0.35418466  0.18694738]
 [ 0.34282622  0.2994648   0.07960885  0.22290479  0.14465098  0.35773171
   0.24831188  0.54947048  0.42514516  0.20867683]
 [ 0.31347888  0.28813673  0.11907337  0.18919822  0.16281461  0.34506977
   0.24647841  0.54676549  0.4710862   0.20377863]]

To create a vector space that simple has relative frequencies (which have not been normalized using IDF's), we can simple add the following parameter:



In [18]:

    
vec = TfidfVectorizer(max_features=10,
                      use_idf=False)
X = vec.fit_transform(texts).toarray()
print(vec.get_feature_names())









    



['and', 'he', 'her', 'in', 'it', 'of', 'that', 'the', 'to', 'was']

Of course, the list of features extracted is not altered by changing this argument, but they values will have changed.

Feature types

So far, we have only considered word frequencies as stylometric style markers - where we naively define a word as a space-free string of alphabetic characters. Implicitly, we have been setting the analyzer argument to 'word':



In [19]:

    
vec = TfidfVectorizer(max_features=10,
                      analyzer='word')
X = vec.fit_transform(texts)
print(vec.get_feature_names())









    



['and', 'he', 'her', 'in', 'it', 'of', 'that', 'the', 'to', 'was']

It becomes clear, therefore, that sklearn is performing some sort of tokenization internally. Inconveniently, it also removes certain words: can you find out which?

To override this default behaviour, we need a little hack. One common solution is to create our own analyzer (i.e. tokenizer) function and pass that to our vectorizer:



In [20]:

    
def identity(x):
    return x.split()

vec = TfidfVectorizer(max_features=10,
                      analyzer=identity,
                      use_idf=False)
X = vec.fit_transform(texts)
print(vec.get_feature_names())









    



['a', 'and', 'he', 'i', 'in', 'of', 'that', 'the', 'to', 'was']

Does this solve our issue?

Additionally, sklearn supports the extraction of character n-grams, which are also a common feature type in stylometry. Interestingly, it allows us to specify an ngram_range: can you figure out what it achieves? (Executing the block below might take a while...)



In [21]:

    
vec = TfidfVectorizer(max_features=10,
                      analyzer='char',
                      ngram_range=(2, 2))
X = vec.fit_transform(texts)
print(vec.get_feature_names())

vec = TfidfVectorizer(max_features=30,
                      analyzer='char',
                      ngram_range=(2, 3))
X = vec.fit_transform(texts)
print(vec.get_feature_names())









    



[' a', ' t', 'd ', 'e ', 'er', 'he', 'in', 's ', 't ', 'th']
[' a', ' h', ' i', ' m', ' o', ' s', ' t', ' th', ' w', 'an', 'd ', 'e ', 'en', 'er', 'ha', 'he', 'he ', 'in', 'n ', 'nd', 'o ', 'on', 'ou', 'r ', 're', 's ', 't ', 'th', 'the', 'y ']

Here, we have to watch out of course, because specifying such ranges will interfere with the max_features parameter. Because bigrams are much more frequent than tetragrams, for instance, the tetragrams might never make it to to frequency table, if the max_features paramater isn't high enough! Naturally we could gain more control over this extraction process, by running two independent vectorizers, and stacking their respective outcomes:



In [23]:

    
vec = TfidfVectorizer(max_features=50,
                      analyzer='char',
                      ngram_range=(2, 2))
X1 = vec.fit_transform(texts).toarray()

vec = TfidfVectorizer(max_features=100,
                      analyzer='char',
                      ngram_range=(3, 3))
X2 = vec.fit_transform(texts).toarray()

import numpy as np
print(X1.shape)
print(X2.shape)
X = np.hstack((X1, X2))
print(X.shape)









    



(27, 50)
(27, 100)
(27, 150)

Here, we finally obtain a matrix with all features.

Controlling the vocabulary

In this final section, it is worth discussing another set of parameters in the signatures of the sklearn vectorizers, that are especially useful for stylometric research. Culling is a good issue to start with. Although 'culling' is used in a number of different meanings, it typically means that we remove words which aren't well distributed enough over the texts in the corpus. If a specific word - e.g. a character's name - is extremely frequent in only one text, it might end in our list of most frequent features, even though it doesn't scale well to other texts. Using 'culling' we specify the minimum proportion of documents in which a feature should occur, before it is allowed inside the vectorizer's vocabulary. In the sklearn vectorizers, this culling property can be set using the min_df argument. Here we see of the 1000 columns we requested, only 615 remain because of the culling:



In [24]:

    
vec = TfidfVectorizer(max_features=1000, min_df=.95)
X = vec.fit_transform(texts)
print(X.shape[1])

Likewise, it is also possible to specify a max_df, or the proportion of documents in which an item should occur. This setting might be useful if you wish to remove the focus in your experiments on function words only, and also take into consideration some items from lower frequency strata.



In [25]:

    
vec = TfidfVectorizer(max_features=100, max_df=.40)
X = vec.fit_transform(texts)
print(vec.get_feature_names())









    



['ada', 'adam', 'adams', 'allworthy', 'amelia', 'archdeacon', 'arthur', 'barchester', 'barry', 'bennet', 'bessie', 'bingley', 'bishop', 'bounderby', 'brady', 'brooke', 'casaubon', 'catherine', 'celia', 'chancellor', 'collins', 'colonel', 'crawley', 'crimsworth', 'darcy', 'dashwood', 'dobbin', 'dorothea', 'dr', 'duke', 'edward', 'elinor', 'eliza', 'elizabeth', 'elton', 'emma', 'erle', 'finn', 'fitzgibbon', 'fleur', 'glegg', 'gradgrind', 'graham', 'grantly', 'harding', 'harlowe', 'harriet', 'hath', 'heathcliff', 'helen', 'james', 'jarndyce', 'jellyby', 'jervis', 'joseph', 'kennedy', 'knightley', 'la', 'laura', 'le', 'linton', 'lopez', 'louisa', 'lovelace', 'maggie', 'major', 'marianne', 'monsieur', 'murdstone', 'osborne', 'pamela', 'paris', 'parliament', 'partridge', 'peggotty', 'pendennis', 'phineas', 'pitt', 'proudie', 'pullet', 'pupils', 'quin', 'quoth', 'rebecca', 'reed', 'richard', 'roby', 'schoolroom', 'sedley', 'seth', 'slope', 'temple', 'toby', 'trim', 'tulliver', 'weston', 'wharton', 'wi', 'willoughby', 'woodhouse']

As you can see, the max_df takes us away from the high-frequence function words, with a lot of proper nouns coming through. By the way: make sure that you specify min_df and max_df as floats: it you specify them as integers, sklearn will interpret these number as the minimum or maximum number of individual documents in which a term should occur.

Finally, it is good to know that we can also manually specify vocabularies, through the vocabulary argument. This way, we can exercise a much tighther control over which words go into a procedure - and manually remove words from a previous analysis, if necessary.



In [227]:

    
vec = TfidfVectorizer(vocabulary=('my', 'i', 'we'))
X = vec.fit_transform(texts)
print(vec.get_feature_names())









    



['my', 'i', 'we']